Skip to content

Conversation

@ilicmarkodb
Copy link
Contributor

@ilicmarkodb ilicmarkodb commented Oct 22, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

In this PR, getStatsSchema is extended to include the collated stats schema for any collation referenced by DataSkippingPredicate (schema example). Also, added E2E tests for collated data skipping.

How was this patch tested?

New tests.

Does this PR introduce any user-facing changes?

No.

Comment on lines -30 to -35
/**
* Set of {@link CollationIdentifier}s referenced by this predicate or any of its child
* expressions
*/
private final Set<CollationIdentifier> collationIdentifiers;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this field. We already have collationIdentifier in Predicate, so having this just increases complexity. Also, getReferencedCollations is only called once in the codebase, so we don't lose much by not persisting its value.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments/questions on the src code. Will review the tests in a bit.

Comment on lines +301 to +305
/**
* Given a data schema and a set of collation identifiers returns the expected schema for
* collation-aware statistics columns. This means 1) replace logical names with physical names 2)
* set nullable=true 3) only keep collated-stats eligible fields (`StringType` fields)
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we do in delta spark for the stats read schema? Or is there an optimization where we only read the collated stats for columns referenced in the predicate (with that collation)?

Maybe this is a future optimization we could consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can optimize it like that. We can do this for both collated and binary stats. Let's leave this for future optimization so we don't complicate this PR further.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments + pls fix the failing tests. Afterward, LGTM

}
}

test("partition and data skipping - combined pruning on partition and data columns") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - add collation desc to title

}
}

test("data skipping - predicates with SPARK.UTF8_BINARY on data column") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case over nested string column?

test("pruneStatsSchema - collated min/max columns") {
val utf8Lcase = CollationIdentifier.fromString("SPARK.UTF8_LCASE")
val unicode = CollationIdentifier.fromString("ICU.UNICODE")
val testSchema = new StructType()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I think you could define vals to make this simpler that can be reused?

val nestedField = ...
val s1Field = ...
val allFields = ... // (s1 + i1 + i2 + nested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants